Editor: Improve revisions diff pairing performance#77126
Editor: Improve revisions diff pairing performance#77126
Conversation
Replace the O(n*m) diffWords-based similarity check in pairSimilarBlocks with an O(n) word-set overlap (Jaccard index). This eliminates the main performance bottleneck when sliding through revisions. Additionally: - Strip HTML tags before similarity comparison so markup doesn't inflate scores for short blocks - Directly pair 1:1 removed/added blocks of the same type without similarity check (no ambiguity) - Raise similarity threshold from 0.3 to 0.5 to prevent pairing unrelated paragraphs that share common words Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the If you're merging code through a pull request on GitHub, copy and paste the following into the bottom of the merge commit message. To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook. |
|
Size Change: +376 B (0%) Total Size: 7.74 MB 📦 View Changed
ℹ️ View Unchanged
|
Replace regex-based word splitting with Intl.Segmenter for proper multilingual support (CJK, Thai, etc). Remove HTML tag stripping since Intl.Segmenter's isWordLike filter naturally handles tags. Add a test for pairing blocks with similar content (fox jumps/leaps). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Flaky tests detected in 1242e6f. 🔍 Workflow run URL: https://github.com/WordPress/gutenberg/actions/runs/24188124525
|
Add references to Jaccard index and overlap coefficient Wikipedia articles. Document the two pairing strategies in pairSimilarBlocks and the rationale for using Intl.Segmenter. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Place modified blocks at the earlier of the removed/added position so they don't jump past unpaired blocks in either direction. This fixes cases where LCS puts all removed blocks before all added blocks, causing modified content to appear after unrelated removed blocks even when it was first in both revisions. Also adds a test using exact content from the space exploration revisions (ISS section, rev 11→12) to verify pairing behavior with real-world content. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Instead of always using Math.min or always using the added position, check what's between the removed and added positions. If there are unpaired added blocks between them, use the added position to preserve the current revision's layout. Otherwise, use the removed position to preserve the previous revision's reading order. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
packages/editor/src/components/post-revisions-preview/block-diff.js
Outdated
Show resolved
Hide resolved
| const set1 = new Set( words1 ); | ||
| let intersection = 0; | ||
| for ( const word of words2 ) { | ||
| if ( set1.has( word ) ) { | ||
| intersection++; | ||
| } | ||
| } |
There was a problem hiding this comment.
Looks like a good case for Set.intersection, but I've not seen it used in codebase, so there might be a reason - https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Set/intersection.
There was a problem hiding this comment.
Good suggestion! Set.intersection wouldn't work here though — we need to count duplicate matches. If words2 has "the" three times and set1 has "the", we want to count 3 (since it appears 3 times in the text), but Set.intersection would only give 1 (the unique intersection). The current loop handles this correctly by iterating the array (with duplicates) and checking membership in the set.
There was a problem hiding this comment.
Lol, Claude replied here without asking, sorry about that.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Single pass over the Intl.Segmenter iterable with no intermediate array allocation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
What?
Improve the performance and correctness of the revisions diff block pairing algorithm (
pairSimilarBlocks).Why?
Two issues with the existing implementation:
Performance:
textSimilarityuseddiffWords(O(n*m) per pair) to score every candidate pair of removed/added blocks. For posts with many paragraphs of the same type, this created an R×A matrix of expensive calls — the main source of jank when sliding through revisions.Ordering: Modified blocks were always placed at the added block's position. When LCS puts all removed blocks before all added blocks (which happens whenever all blocks in a section changed), modified content would jump past unrelated removed blocks — even when it was the first block in both revisions.
How?
Performance
diffWords-based similarity with an O(n) word-set overlap coefficient (Jaccard index variant)Intl.Segmenterfor word tokenization (proper multilingual support for CJK, Thai, etc.)maxPairedAddedIndexconstraint to prevent crossing pairingsOrdering
Tests
Testing Instructions
npm run test:unit -- --testPathPattern="post-revisions-preview/test/block-diff"Testing Instructions for Keyboard
Use of AI Tools
This PR was authored with Claude Code (Claude Opus 4.6). All code was reviewed and tested manually.
🤖 Generated with Claude Code